-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
BUG: Fix TypeError in json_normalize with non-str meta key and record_path #63028
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
BUG: Fix TypeError in json_normalize with non-str meta key and record_path #63028
Conversation
f24285a to
84233a2
Compare
|
Hi reviewers, P.S. While working on this fix, I noticed that the type hint for the My PR ensures that non-string keys (like the int key in our test case) are handled consistently, whether record_path is specified or not. This aligns with the existing behavior when record_path=None, which already supports non-string keys (mirroring pd.DataFrame's ability to have non-string column names). I've kept this PR scoped strictly to fixing the TypeError and ensuring consistent behavior. Would a separate, follow-up issue or PR to discuss updating the type hint (perhaps to something like Hashable or Any) to match this behavior be welcome? Just wanted to bring it to your attention. Thanks! |
| - Fix bug in ``on_bad_lines`` callable when returning too many fields: now emits | ||
| ``ParserWarning`` and truncates extra fields regardless of ``index_col`` (:issue:`61837`) | ||
| - Bug in :func:`pandas.json_normalize` inconsistently handling non-dict items in ``data`` when ``max_level`` was set. The function will now raise a ``TypeError`` if ``data`` is a list containing non-dict items (:issue:`62829`) | ||
| - Bug in :func:`pandas.json_normalize` raising ``TypeError`` when ``meta`` contained a non-string key (e.g., ``int``) and ``record_path`` was specified, which was inconsistent with the behavior when ``record_path`` was ``None`` (:issue:`63019`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs state meta should be a string or list of strings.
https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html
Why are we supporting non-strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When record_path is None: pd.json_normalize([{'a': 1, 12: 'val'}], meta=[12]) already works perfectly. It correctly creates an integer column named 12, which is very useful and consistent with pd.DataFrame itself supporting non-string column names.
When record_path is set: As this issue shows, the exact same call (meta=[12]) suddenly fails with a TypeError simply because record_path was added.
This felt like a clear bug. My PR doesn't introduce new support for non-strings; it just fixes the TypeError so the function behaves consistently with itself, whether record_path is used or not.
It seemed better to fix this inconsistency (Path 1) rather than introduce a new breaking change to remove the existing, undocumented support from Path 2 (e.g., by adding a TypeError to the record_path=None case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm negative on expanding the scope of json_normalize to handle invalid JSON data.
My PR doesn't introduce new support for non-strings; it just fixes the TypeError so the function behaves consistently with itself, whether record_path is used or not.
This is supporting non-strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct that the docs specify str, but the function's actual behavior already differs from the docs.
When record_path=None, json_normalize already works perfectly with non-string keys (like int). This is the existing behavior my PR is based on.
My PR is just a bug fix to make the function behave consistently, fixing the TypeError that only happens when record_path is added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import pandas as pd
data = [
{
"a": 1,
12: "meta_value_1", # int key
"nested": [{"b": 2, "c": 3}],
},
{
"a": 6,
12: "meta_value_2",
"nested": [{"b": 7, "c": 8}],
},
]
df = pd.json_normalize(
data,
record_path=None,
meta=[12, "a"],
)
print(df)
print(type(df.columns[1]))
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.